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Abstract. We present a coherent Bayesian framework for selection of the most 
likely model from the five genetic models (genotypic, additive, dominant, co¬ 
dominant, and recessive) commonly used in genetic association studies. The ap¬ 
proach uses a polynomial parameterization of genetic data to simultaneously fit 
the five models and save computations. We provide a closed-form expression of 
the marginal likelihood for normally distributed data, and evaluate the perfor¬ 
mance of the proposed method and existing method through simulated and real 
genome-wide data sets. 
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1 Introduction 

Genome-wide association studies have been a popular approach to discover genetic 
variants that are associated with increased risk for rare and common diseases (Sebastiani 
et al. (2009)). The most common variants in the human genome are single nucleotide 
polymorphisms (SNPs): DNA bases that can vary across individuals. Typically SNPs 
have two alleles, say A and B, and based on the combination of SNPs alleles in each 
chromosome pair (the genotype), an individual can be homozygous for the A allele if 
both chromosomes carry the allele A, homozygous for the allele B if both chromosomes 
carry the B allele, and heterozygous when the two chromosomes carry the A and B 
alleles. Genotyping DNA was a slow and expensive process until mid-2000, when high 
throughput technologies produced microarrays that can generate the genetic profiles of 
an individual in hundreds of thousands to millions of SNPs, and the technology was 
the trigger for an explosion of genome-wide association studies (GWAS) to discover the 
genetic base of common diseases. 

Typically in a GWAS the association between each SNP and a quantitative trait is 
tested using linear regression under a specific genetic model that can assume a geno¬ 
typic (2 degrees of freedom), dominant, recessive, co-dominant, or additive mode of 
inheritance of each tested SNP. In a genotypic model the 3 genotypes AA, AB and BB 
are treated as a factor with 3 levels. The other 4 genetic models compress the 3 geno¬ 
types into a numerical variable by either counting the number of minor alleles (additive 
model), or by recoding the genotypes as AA=0 versus AB, BB=1 (dominant model for 
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the B allele), AA, AB=0 versus BB=1 (recessive model for the B allele), AA, BB=0 
versus AB=1 (co-dominant model). However, the inheritance pattern is rarely known, 
and using a suboptimal model can lead to a loss of power (Lettre et al. (2007)). 

Selecting the correct genetic model for each SNP is often accomplished by fitting 
the five models and choosing the model that describes the data best. This approach 
has several drawbacks. It increases computational burden with genome-wide data as 5 
GWASs need to be conducted. Furthermore, testing five models for each SNP increases 
the burden of multiple testing in addition to the existing issue of multiple comparisons 
with millions of SNPs. More importantly, the optimal method for choosing the best 
model is not clear (Lettre et al. (2007)). The common practice is to simply use the 
additive genetic model. It has been shown that additive models perform reasonably 
well to detect variants that have additive or dominant inheritance pattern, but they are 
underpowered when the true mode of inheritance is recessive (Bush and Moore (2012)). 
Others (Freidlin et al. (2002); Gonzalez et al. (2008); Li et al. (2008); So and Sham 
(2011)) have proposed to study the maximum of the three test statistics derived under 
additive, dominant, and recessive models. 

We propose a polynomial parameterization of the genetic data that includes the 
five genetic models as special cases, and we describe a coherent Bayesian framework to 
select the most likely genetic model given the data. This polynomial parameterization 
is equivalent to the genetic model described in Servin and Stephens (2007) that adds 
a dominance effect to the additive model to describe non-additive genetic effects. The 
advantage of either parameterization is that, in a Bayesian framework, fitting a single 
model becomes sufficient to test the genotype-phenotype associations without specifying 
a particular genetic model and this problem has been described in detail in Servin and 
Stephens (2007). Here, we focus on the specific task of selection of the best genetic 
model when the specific mode of inheritance is of interest in addition to whether a SNP 
is associated with the trait. 

The next section describes this parameterization and shows that there is a mathe¬ 
matical relationship between the parameters of the polynomial model and each of the 
five possible genetic models. Section 3 describes the model selection approach that is 
based on the computation of the marginal likelihood of the five models so that the 
model with maximum posterior probability can be identified. Section 3 also provides 
closed form solutions for the marginal likelihood and for the estimates of the parame¬ 
ters of the model with the highest marginal likelihood or Bayes Factor (BF), assuming 
exchangeable observations that follow normal distributions. The proposed method is 
evaluated through simulation studies in Section 4, and is applied to two GWAS data 
sets in Section 5. Gonclusions and suggestions for further work are provided in Section 6. 


2 Relationship Between the Polynomial Model and 
Other Genetic Models 

Here we show that the five genetic models are specific cases of a general polynomial 
model, with parameters that satisfy some linear constraints. Let y denote the response 
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variable in the genetic association study, and consider the polynomial regression model 

E{y\l3) = /3o + l3lXadd + ^2xldd 

where /3 denotes the vector of regression parameters and Xadd is the variable that codes 
for the genotype data as follows: 

{ 0 if genotype is AA 

1 if genotype is AB 

2 if genotype is BB. 

Note that the proposed model is equivalent to the additive model with dominance effects 
described in Servin and Stephens (2007): 

= ^0 + OlXadd + (^2Xhet 

where Xhet = 1 for heterozygous genotype and 0 otherwise, and 9i = P + 2132', 02 = —P 2 - 
Mathematically, we found the polynomial parameterization more appealing as it allows 
interpretation of the regression coefficients in terms of the SNP dosage. 

2.1 Genotypic Association Model 

The genotypic association model is typically parameterized using two indicator variables 
to describe the effect of the genotypes AB and BB relative to AA: 

E{y\l) = 70+ llXAB + I 2 XBB 

XAB = 1 if genotype is AB (and 0 otherwise) 

and = 1 if genotype is BB (and 0 otherwise). 

This parameterization specifies the expected value of y, for each of the 3 genotypes 
AA, AB, BB as summarized in Table 1. Equating the expected values of y from the 2 
different parameterizations produces a system of linear equations: 


7o 


Po 

7o + 7i 

= 

Po + Pi + P 2 

_ 7o + 72 _ 


Po + 2/3i + 4/?2 


that can be solved as 


70 


■ 1 0 0 ■ 


Po 


■ 1 0 0 ■ 

7i 

= 

0 1 1 


Pi 

= 

0 1 1 

. 72 . 


0 2 4 


. P^ . 


0 2 4 


Therefore, the parameters in the polynomial model and the genotypic association model 
have a one-to-one relationship. For the other genetic models, some constraints on pa¬ 
rameters of the polynomial model are necessary. 
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2.2 Additive Model 

The parameterization of the additive genetic model is E{y\aA) = oao + ctA^add and 
equating the expected values of y in Table 1 leads to the system of linear equations: 

/3o = OiAO 
Pi + P2 = a A 
2Pi + 4/32 = 2a^ 

that can be solved if /32 = 0, so that /3o = oao, and /3i = a a- Therefore the relationship 
between the parameters in the polynomial model and additive model requires a linear 
constraint on the vector /3. 

2.3 Dominant Model 

Now, consider the dominant model for the B allele: E{y\aD) = auo + otoxoam, where 
XDom=l if genotype is AB or BB (and 0 otherwise). Proceeding as in the previous cases 
leads to the equations: 

OiDO = Po 

ocD = PiE P 2 = 2/3i + 4/32. 

The system has the solution aon = Po and ao = §/?i if parameters of the polynomial 
model satisfy the constraint Pi + 3/32 = 0. 

2.4 Recessive Model 

Similarly, consider the recessive model for the B allele: E{y\aii) = otm + aRXRec, 
where xiiec=l if genotype is BB (and 0 otherwise). In this case, the relations between 
parameters are: 

am = Po = Po + Pi + P 2 
aR = 2/3i + 4/32 

with linear constraint Pi + P 2 = Q, cirq = /3o and or = 2/3i. 


2.5 Co-dominant Model 

Lastly, consider the co-dominant genetic model: E{y\ac) = aco + acxcod, where 
xcod=l if genotype is AB (and 0 otherwise). In this case: 

aco = Po = Po + 2/3i -f 4/32 

ac = Pi+ P2- 

The linear constraint is Pi + 2/32 = 0, so that aco = Po and ac = ^Pi- 

In summary, there is a one-to-one transformation between the parameters of the 
polynomial and general model, while the transformation between the polynomial and 
the other four models (additive, dominant, recessive, and co-dominant) is constrained 
by a linear contrast of the parameters in the polynomial model. 
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Polynomial 

Model 

2 -df 

General 

Model 

Additive 

Model 

Dominant 

Model 

Recessive 

Model 

Co- 

dominant 

Model 

E{y\AA) 

do 

70 

OtAO 

OLDO 

01 m 

“CO 

E(y\AB) 

do + dl + d 2 

70 + 71 

OtAO + OlA 

Odd + “D 

«H0 

“CO + “C 

E{y\BB) 

do + 2dl+4d2 

70 + 72 

OiAO + 2aA 

OiDO + “D 


“CO 


Table 1: Expected value of the quantitative trait for 3 genotypes in each model. 


3 Model Selection via Marginal Likelihood and 
Parameter Estimation 

The polynomial parameterization provides a framework to simultaneously fit different 
genetic models. Given a sample of genotype data, the question is how to select the most 
appropriate genetic model. We propose a Bayesian model selection approach in which 
genetic models are compared based on their marginal likelihood and the model with 
largest marginal likelihood is selected, assuming that a priori the 5 genetic models are 
equally likely. 

In the polynomial model (y|/3 = Xf3 + e in matrix form), the data are assumed to 
be exchangeable and follow a normal distribution with: 

y\X,P,Tr^NiXP,-I) 

T 

where I is the identity matrix. A standard normal-gamma prior for the vector of pa¬ 
rameters /? and precision r is assumed such that p(/3,r) = p{(3\t)p{t), where 

T ~ Gamma{ai,a2) 

/3|r - N{i3o,{tRo)-^) 

with Pq, Rq, tti, and 02 as prior hyperparameters. Specification of these prior hyper¬ 
parameters can be subjective and represents the prior probability of alternative genetic 
models. With genome-wide data, most of the tested SNPs are likely to be null SNPs 
and it is both reasonable and convenient to assume non-informative priors. Therefore 
the following values for the prior hyper-parameters: Po = 0, Rq = I, ai = 1, and 
02 = 1 will be assumed. If there is strong prior belief about certain genetic models, 
more informative prior distributions can be chosen and this problem is described at 
length in Servin and Stephens (2007). The marginal likelihood given this polynomial 
model Mp can be computed analytically in the equation below: 

^ f t o \ to 1 a2n"r(ain) l^ol^ 

p{y\Mp) = / p{y\X,P,T)p{P,T)dpdT = 

J (27r)2 02 ^( 01 ) |i?„|2 

with the following updated hyper-parameters: 


= i?o -H X'^X 
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l3n = R-HRo^o + X^y) 
n 

Oln = fll + TT 


a2n — [ 


-PlRnPn + y y + Pq RqPo J_,_i 

2 02 ^ 


Details are for example in O’Hagan and Kendall (1994). In the general genetic model, 
the vector of parameters 7 is a linear transformation of /I, 7 = u;/3, where the matrix w 
is: 


1 0 0 
0 1 1 
0 2 4 


Since 7 is a linear transformation of /3, once a prior distribution for (3 is elicited, the 
prior distribntion of 7 is derived as: 


7|r = w/?|t - iV(a;/3o,w(Ti?o) 


while the prior for r does not change with the re-parameterization. If the prior distri¬ 
butions of the parameters vectors are so defined, then it can be shown that p{y\Mp) = 
p{y\Mo) (see the Supplementary Material for details). In other words, the marginal 
likelihood is invariant under linear transformations of the regression coefficients. 


Derivation of marginal likelihoods for additive, dominant, recessive, and co-dominant 
models is different, as these models are defined by a linear transformation of the parame- 


ters of the polynomial model and an additional constraint. Formally, let a = 
note the vector of parameters in any of these models. Then we can define a = 


ai 

ao 

0.1 


de- 


00 

01 


|02 = 0 where 6 = 


00 

01 

02 


= w/? and matrix w depends on the specific genetic 


model (see Table 2). If the vector (3 follows a multivariate normal distribution, 9 also 
follows a multivariate normal distribution, and so does the marginal distribution of 02 
and a that is a conditional distribution. Starting from the proper prior distributions for 
the vector of parameters /3 and precision r priors, then proper prior distributions for a 
and r are found to be: 

r Gamma{ai, 02 ) 


a = 


OQ 


■ 00 ■ 

Ol 


_1 


|02 = O^iV(/io,r-'Eo^) 


go and can be obtained by using properties of the conditional multivariate normal 
distribution (Eaton (1983)) and are summarized in Table 3. Using these derived priors, 
the marginal likelihood for the additive, dominant, recessive, and co-dominant models 
{Ma, Md, Mh, and Me, respectively) can be computed in closed form. The derivation 
of the marginal likelihood for the dominant model is detailed in the Supplementary Ma¬ 
terial. Derivation of the marginal likelihood for the additive, recessive, and co-dominant 
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2-df General 
Model 

Dominant 

Model 

Recessive 

Model 

Codomiant 

Model 

Additive 

Model 

U) 


1 0 0 

0 1 1 

0 2 4 



1 0 0 

0 1 1 

0 1 3 



1 0 0 

0 2 4 

0 1 1 



1 0 0 

0 1 1 

0 1 2 



1 0 0 

0 1 1 

0 0 1 



Table 2: Specification of in for five genetic models. 


model is similar. Note that the derivation relies on the use of proper prior distributions 
for the parameters of the polynomial model. 

Assuming that the 5 genetic models are a priori equally likely, the Bayes rule to 
model selection is equivalent to choosing the genetic model with the highest marginal 
likelihood or BF relative to the null model (i.e. ratio of marginal likelihood of one of 
the five models and the null model) (Kass and Raftery (1995)). Once the most likely 
model is selected, the parameter estimates of any of the five genetic models are the 
means of the posterior distributions. The regression parameters in the polynomial model 
are estimated by /3„ = + X'^y) and using the one-to-one relationship, the 

parameters in the general model can be estimated by 7 n = ujf3n- The relation between 
parameters of the polynomial models and the dominant, recessive, co-dominant, and 
additive models can be used to derive their posterior estimates. Specifically, from the 
set of relations: 

/3|r ~ N{l3n, (ri?„)“^) 

9 = ojP\t ^ N{9n = nj3n,n{TRn)~^uF) 

|02 = O-iV(/r„,r-iE-i) 

and using the properties of the conditional multivariate normal distribution, the point 
estimates for any model are found to be: 

+ [<S'l2][<S'22] ^[0 — 9n2] 


l^n — 


^nO 



where uj{tR^ 




^-1 


and dim{Sii) = 2x2, dim{Si 2 ) = 2x1, 


iSll >S'l2 

>S'21 S 22 

dim{S 2 i) = 1x2, and dim{S 22 ) = 1x1. Table 3 summarizes the specification of w and 
formulas for computing prior and updated hyper-parameters and marginal likelihood 
for different genetic models discussed in this section as well as the null model. 


4 Simulation Studies 

Three simulation studies were conducted to assess false and true positive rates of the 
Bayesian procedure with polynomial models and compared to the frequentist approach 
in which the association with minimum p-value is selected. Simulation study (1) was de¬ 
signed to evaluate the false positive rates of the polynomial model approach for various 


























60 


Polynomial Coding of Genetic Data 


Prior Hyper 
parameters 


Posterior Hyper 
parameters 


R-n = Ro + 

ain = 01 + r./2 

' -fll Rnfin+V^y 


Marginal 

Likelihood 


00 = [/^oo 001 002 ]'^ 


Rq = 


^22 

^32 


’^23 

^33 


^ 2 n = I 
00 Ro0o 




I fin 


1/2 


(27r)’^/2 |fi~ |l/2 


70 = t^/3Q 

= a;(Ro)“^^ 


7n = 

«ln = «1 + ri/2 

'-0'^Rn0n+y'^y 


00 Ro0o , 


p(y|Mc-) = -crri;-- 

“2 ^ '•“ 1 .' 

( 2 , r )'*/2 |(^- 1 ) Tr „„- 1 | 1/2 


p(Kl^fi) = m -r 

«2 ^(“ 1 j 




p(p-Ro) 


«0 
«0 
«0 
= So = 




«13 

S 3 


*0 

^00 

^01 

(«B^) 

oil 


®02 


s31 

Pn = 
gl3 
= 23 


s32 

®n 0 


(^n )" 


s23 

o 33 


3 I 3 

c 23 


|SqI 


1/2 


(27r)’^/2 is” |l/2 

“ 2 n^ I’C^ln) 

where i= dominant, recessive 
codominant or additive 


(g33)-l(g31 s32) 


“In = “1 + “/2 

' -pTSnPn+y"^ 

—- 


P^ ^OPO , 1 ^ 

3 “2 j 


-1 


. = H-^SoSo + 1 -" P) 
Rn = Rq + I'^y 

“In = “1 + n/2 

' -S^HnSn+y^y 


So = [Soo] 
P^o = (Pll) 


P(y\^nul0 = 

\Rn 


ir(ci) 

1/2 


So -RqSq ^ ^ 


(2n)P/3 |H„| 1/2 
“ 2 n"r(“ln) 


Table 3: Specification prior hyper-parameters, updated hyper-parameters, and 
marginal likelihood for each model. Mp=Polynomial Model, MG=Genotypic Model, 
Mp)=Dominant Model, Mp=Recessive Model, Mc=Codominant Model, M^=Additive 
Model, and MAr=Null Model. 


selection criteria. Real genotype data from two GWASs of different sample sizes were 
used and the quantitative trait in each set was randomly permuted to create data sets 
with no true positive associations. Simulation study (2) was designed to compare sensi¬ 
tivity and specificity of our proposed method and the standard approach by simulating 
genetic data that mimic the GWAS setting with causal SNPs (i.e. SNPs truly asso¬ 
ciated with the trait) having different modes of inheritance, each SNP explaining the 
same proportion of the trait variability. Simulation (3) modified the design of simulation 
(2) and let SNPs explain varying proportions of the trait variability. 
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Data : Two real datasets were used. The first data set consisted of genotype data of 201 
unrelated offspring of centenarians from the New England Centenarian Study (NECS) 
(http://www.bumc.bu.edu/centenarian/) (Sebastiani and Peris (2012)). The geno¬ 
type data were described in Sebastiani et al. (2012). The quantitative trait in this 
analysis was a neuroticism score measured in the NEO-Five Factor Inventory (NEO- 
FFI), which is a 60-item (12 items per domain) measure of five personality dimensions 
(neuroticism, extraversion, openness, agreeableness, and conscientiousness) (Costa and 
McCrae (1992)). Previous studies have shown that the estimated heritability of neuroti¬ 
cism is approximately 25% (Bae et al. (2013); Pilia et al. (2006)). The second data set 
consisted of 843 unrelated African-American subjects with sickle cell anemia enrolled 
in the Cooperative Study of Sickle Cell Disease (CSSCD) (https ://biolincc .nhlbi . 
nih.\gov/studies/csscd/) (Gaston and Rosse (1982)). In this cohort, the trait is the 
percent of fetal hemoglobin in the total hemoglobin. The percent fetal hemoglobin is a 
major modulator of hematologic and clinical complications of sickle cell anemia (Akin- 
sheye et al. (2011)). Studies have shown that there is a strong genetic basis of fetal 
hemoglobin and a well-established gene that affects this trait is BCLllA (Bae et al. 
(2012)). The estimated heritability ranges from 60.9% to 89% (Garner et al. (2000); 
Pilia et al. (2006)). Both studies were approved by the institutional review board of 
each participating institution, and standard quality control procedures were performed 
on both genotype data (Bae et al. (2012); Sebastiani et al. (2012)). 

Methods : Initially 254,612 and 486,331 autosomal SNPs were available for analysis in 
the two cohorts (NECS and CSSCD), respectively. It is well known that SNPs in close 
proximity tend to be correlated with each other (Slatkin (2008)), and this non-random 
correlation can bias the assessment of false positive rates. In order to avoid this problem, 
SNPs whose pairwise correlation was > 0.2 were removed using the PLINK software 
(Purcell et al. (2007)). After this pruning, 50,894 and 140,864 independent SNPs were 
left for analysis in the NECS and CSSCD sample, respectively. In both sets, 50,000 
SNPs were randomly chosen from each set and 10,000 simulations were performed by 
permuting the trait values at random. Two approaches were evaluated in this simulation 
study: 1) the proposed method, in which the best genetic model was selected based on 
the maximum BE for each SNP; and 2) the frequentist approach, in which five genetic 
models were fitted and the best model was selected based on the minimum p-value for 
each SNP. For the genotypic model (2 degrees of freedom) in the frequentist approach, 
the minimum of the two p-values was chosen. Various threshold criteria for the two 
approaches were explored and the number of significant associations detected for varying 
thresholds was recorded. False positive rates were computed as the rates of significant 
associations. 

4.2 Simulation Study (2) 

Data : In order to assess the true positive rates of our proposed method and the stan¬ 
dard approach, genetic data were simulated with known causal SNPs, each explaining 
the same proportion of the trait variance. A modification of the simulation procedure 





62 


Polynomial Coding of Genetic Data 


described in Yip and Lange (2011) was used to simulate the data but an additional 
source of variability was introduced as well as SNPs with dominant and recessive mode 
of inheritance, in addition to additive effects. Several scenarios were considered by us¬ 
ing different sample sizes (N=1,000, 10,000, 20,000, 50,000, and 100,000) and different 
heritability (Ober et al. (2001)) of the quantitative traits (/i^=0.2, 0.4, and 0.6). Heri- 
tability is defined as the proportion of the total variance of the trait that is explained by 
the genetic effect and the higher the heritability the larger the genetic contribution to 
the trait. A total of 500,000 SNPs were simulated in each run and included 100 causal 
SNPs: 34 with additive effects, 33 with dominant effects, and 33 with recessive effects. 
We assumed that each causal SNP explained 1/100 of the total genetic variance so 
that, for example, when the total heritability was 0.2 and 20% of the total phenotypic 
variance was due to the genetic variance, each causal SNP explained 1/100 of the total 
genetic variance and hence the SNP-specific heritability was 0.002. 

Methods : The following steps describe the simulation scheme. 

Step 1. Generate minor allele frequency for each SNP 

The minor allele frequency (MAP: frequency of B allele) for each SNP was randomly 
drawn from a Beta distribution Beta{2,8 ), which represents the distribution of commer¬ 
cially available chips (Yip and Lange (2011)). We also used a standard quality control 
procedure by excluding any SNPs with MAF less than 0.01. 

Step 2. Generate the genotype 

Genotypes for each SNP (AA, AB, BB) were generated assuming Hardy-Weinberg 
equilibrium. Essentially, if p is the prevalence of the A allele in the population, Hardy- 
Weinberg equilibrium (HWE) law states that the prevalence of the three genotypes will 
be p^, 2p(l —p), (1 —pY (Weinberg (1908)). These expected genotype frequencies were 
used to simulate the genotype data, given p. 

Step 3. Select the causal SNPs 

100 causal SNPs from the total SNPs were randomly chosen and assigned the mode 
of inheritance randomly to the selected SNPs. 

Step 4- Determine the effect size for each causal SNP 

The effect size Oj for each causal SNP (j=l, 2,..., 100) was computed from the 
formula: 

u2 _ ^Add,j + ^Dom,j _ 2pj(l — Pj)[aj + dj{l — 2pj)]^ + ]f2pj(\ — Pj)djY 
^Total ^Totals 

where cr^cid j additive genetic variance of the causal SNP, ^ is the dom¬ 
inance genetic variance of the causal SNP, cr^otai total phenotypic variance, 

Pj is the MAF for the causal SNP, Oj is the additive genetic effect at the causal 
SNP, and dj is the dominance genetic effect at the causal SNP. The parameter h^ 

is the locus-specific heritability, which was assumed to be This is the amount of 
heritability that is contributed by the causal SNP and hence all causal SNPs con¬ 
tribute to the total heritability by an equal amount. In the above formula, note that the 
genetic variance is decomposed into the additive and dominance variance component. 
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The additive genetic variance implies that each additional copy of an allele contributes 
a fixed amount of effect aj to the trait. Under this assumption, the trait value of the 
heterozygote (AB) would be the midpoint between the two homozygotes (AA and BB). 
On the other hand, when there exists dominance genetic variance, the trait value of 
the heterozygote will deviate from the midpoint between the two homozygotes, and the 
degree of deviation is expressed by the quantity dj. Therefore, it follows that dj = 0 
for any SNP with additive effect, and dj = aj for any SNP with dominant effect, and 
dj = —Qj for any SNP with recessive effect. We also assumed (r'^otai = 1- Note that only 
the three genetic models (additive, dominant, and recessive) were considered. 

Step 5. Generate the phenotypic value based on the causal SNPs 

Let Pi denote the phenotypic value for individual. For each causal SNP, the SNP 
contribution to the trait was randomly generated as 

where aj is the effect of the causal SNP (computed in the previous step) and Gij 
is genotype coding for the individual at the causal SNP that was generated in 
Step 2. For an additive causal SNP, Gij= number of minor allele (0, 1, 2). For a dominant 
causal SNP, Gij = 1 if the genotype is AB or BB (0 otherwise). For a recessive causal 
SNP, Gij = 1 if the genotype is BB (0 otherwise). Then, the phenotypic value is: 

Vi = Xij 

3 

and E{yi) = a-nd Var{yi) = = 1. 

3 

Step 6. Perform association tests using our method and the standard approach 
Step 7. Repeat 100 times 

In each simulated data set, the empirical false positive rates in the two approaches 
were evaluated to determine thresholds for p-values and BF with the same false positive 
rates. Specifically, in each simulated set the number of false positive associations (sig¬ 
nificant associations of null SNPs) in the frequentist results with p-values p < 1 x 10“^, 
5 X 10“^, 1 X 10“®, and 5 x 10“® were counted and the BF thresholds that produced 
the same number of false positive associations in the Bayesian approach were detected. 
Using these p-values and BF thresholds that produced the same empirical false positive 
rates, the power of the two approaches was evaluated. Two types of power were consid¬ 
ered: (1) the number of causal SNPs detected as associated regardless of whether the 
correct genetic model was identified and (2) the number of causal SNPs detected with 
the true genetic model. 

4.3 Simulation Study (3) 

Data and Methods : The limitation of Simulation Study (2) is the assumption that each 
SNP accounts for the same proportion of the trait variability. Therefore, the scheme of 
the Simulation Study (2) was modified to let causal SNPs explain varying proportions 
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of the trait variability. In this modified design, the genetic variances of dominant and 
recessive SNPs were increased, while decreasing the genetic variance of additive SNPs. 
This trade-off was necessary to maintain the same total heritability used in Simulation 
Study (2) for proper comparison later. The following two cases were considered: 1) when 
h'j was halved for additive SNPs and 2) when h'j was quartered for additive SNPs. In 
case I), this resulted in increasing hj by 25% for both dominant and recessive SNPs. In 
case 2), this resulted in increasing by 37.5% for both dominant and recessive SNPs. 
Under these changes, the effect sizes were generated based on Step 4 in the previous 
section and the rest of the simulation design remained the same. 


a) Bayesian Polynomial Model Approach 



BF>100 

BF> 500 

BF>1000 

BF>1500 

BF>3000 

BF>5000 

NECS 

data 

9.0 X 10"^ 

1.4 X lO-'^ 

8.0 X lO"'^ 

4.0 X 10"^ 

2.0 X 10“^ 

0.0 X 10"" 

CSSCD 

data 

6.8 X 10"'‘ 

1.2 X 10"'‘ 

6.0 X lO"'" 

4.0 X lO"'" 

2.0 X lO”'" 

0.0 X lO"*^ 


b) Freqnentist Approach 



p < 10-" 

p < lO"'* 

p < lo-'" 

p < lo-*" 

p < 10"' 

p < 10”“ 

NECS 

data 

3.4 X 10”^ 

4.0 X lO"'" 

4.0 X lO"'" 

0.0 X lO”'" 

0.0 X 10““ 

0.0 X lO"*" 

CSSCD 

data 

3.3 X 10"" 

3.8 X lO”'" 

4.0 X 10"^ 

0.0 X lO""* 

0.0 X 10““ 

0.0 X 10"^" 


Table 4: Median false positive rates in the NECS and CSSCD data in 10000 permutations 
(Simulation Study 1). BF=Bayes Factor; p=p-value. 


Results : Table 4 shows the median false positive rates at varying significance thresholds 
in the two sets included in Simulation Study (1). Setting the thresholds to maximum 
BF > 1500 in our approach and minimum p-value < 10“® in the standard approach 
yields the same median false positive rate of 4 x 10“® in both data sets. This translates 
into 2 false positive associations among 50,000 SNPs. 

Figures 1-3 summarize the results of Simulation Study (2) for the scenario in which 
the total heritability is 0.4 and the sample sizes are 10,000, 20,000, and 50,000. The full 
set of results can be found in the Supplementary Materials. With a sample of 1000, nei¬ 
ther approach detects any causal SNPs, while almost all causal SNPs are detected when 
the sample size is 100,000, regardless of the heritability. Figure 1 shows the distribution 
of the empirical false positive rate for different p-value thresholds and Figure 2 shows 
the distribution of the BF that would produce the same empirical false positive rates of 
the frequentist procedure. Figure 3 shows the box plot of the true positive rate (propor¬ 
tion of detected causal SNPs) of the two approaches at varying significance thresholds. 
Finally, Table 5 shows the mean number of additive, dominant, and recessive SNPs that 
are correctly identified (out of 34, 33, and 33, respectively) in the two approaches at 
successive thresholds. 

The mean and standard deviation of the quantitative traits were 2.55 and 1.10 when 
the total heritability was 0.4. Several points are noteworthy. The first point is that, as 
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a) h*=0.4, N=10,000 



b) h*=0.4, N=20,000 c) h*=0.4, N=50,000 




Figure 1: Box plot of cubic root of empirical false positive rates (y-axis) for different 
p-value thresholds (x axis), and increasing sample sizes. The results are based on the 
simulation scenario when the heritability was 0.4 and the sample sizes were 10,000, 
20,000, and 50,000 (panel a, b, and c, respectively). 


a) h‘=0.4, N=10,000 


b) h*=0.4, N=20,000 


c) h*=0.4, N=50,000 


Cl 

CD 


I 



P value Thresholds 



P value Thresholds 



P value Thresholds 


Figure 2: Box plot of log-transformed BF (y-axis) for different p-values thresholds (x- 
axis) and increasing sample sizes. The results are based on the simulation scenario when 
the heritability estimate was 0.4 and the sample sizes were 10,000, 20,000, and 50,000 
(panel a, b, and c respectively). 


expected, lower heritability of the trait results in a smaller number of detections. Even 
when the total heritability was relatively high (/i^=0.6), both approaches detected about 
half of the causal SNPs with N=10,000. At the most stringent significance threshold of 
p < 1 X 10“^, the Bayesian approach correctly identified 53.75 causal SNPs on average 
and the frequentist approach correctly identified 46.52 causal SNPs on average when 
heritability is 0.6 and N is 20,000 (see Supplementary Materials for detail). This result 
is consistent with findings from other authors that large sample sizes are needed to 
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a) h*=0.4, N=10,000 


b) h*=0.4, N=20,000 


c) h^=0.4, N=50,000 


Q> 

$ 

O 

Cl 



P-value thresholds 



P-value thresholds 



P-value thresholds 


Bayesian 

FrequentisI 


Figure 3: Box plot of power of the two approaches at different significance thresholds 
(Red: Bayesian approach; Blue: Frequentist approach). The results are based on the 
simulation scenario when the heritability estimate was 0.4 and the sample sizes were 
10,000, 20,000, and 50,000 (panel a, b, and c, respectively). 


detect many casual variants that explain a small proportion of variability. For example, 
in Park et al. (2010) authors have shown that they need approximately N=25,000 to 
detect 25 loci out of 201 causal variants with 80% power for a highly heritable trait. 

Secondly, we observed that the false positive rate of decision rules based on BF 
decreases as the sample size increases, given a fixed BF threshold. This property has 
also been noted in Matthews (2011) and Wakefield (2012) and implies that we can relax 
the thresholds for BF as we increase the sample size and better leverage the increased 
sample size than frequentist procedures. Figure 2 illustrates this property graphically. 
For example, at a fixed p-value threshold 1 x 10“^, the median BF threshold needed to 
obtain the same false positive rate decreases from 11122 to 9866 to 7489 as the sample 
size increases from 10,000 to 20,000 to 50,000. A similar pattern is observed at all 
levels of false positive rates. In contrast, no such pattern is observed in the frequentist 
approach, and the false positive rates are invariant to sample sizes given a fixed p-value 
threshold in the standard approach. 

The third important point is that the Bayesian method we propose has a slightly 
greater power for more stringent (lower) thresholds (see Figure 3) than the frequentist 
approach. This result holds for all sample sizes and all levels of heritability considered in 
the simulations (see Supplement Figures S3 and S6). Also, at this stringent threshold, 
the Bayesian approach recovered more correct genetic models when the sample sizes 
were 20,000 and 50,000 (Table 5). Although our method recovers less often than the 
frequentist approach SNPs with an additive genetic effect, it identifies more often SNPs 
with a dominant and recessive effect. When the sample size was 10,000, the Bayesian 
approach recovered slightly fewer genetic models. In addition, both approaches iden¬ 
tified nearly 0 models that had either dominant or recessive inheritance pattern when 
N=10,000 in simulation study (2). We speculated that this may be due to the lack of 
power to detect rare variants. For example, if we assume MAF=0.01, under HWE the 






H. Bae, T. Peris, M. Steinberg, and P. Sebastiani 


67 


Bayesian Frequentist 



N 

Significance 

Threshold 

A 

D 

R 

Total 

A 

D 

R 

Total 



1 X 10“^ 

15.6 

0.7 

0.9 

17.2 

17.3 

0.3 

0.3 

17.9 


10000 

5 X 10-’^ 

16.0 

0.8 

1.0 

17.7 

19.8 

0.6 

0.6 

21.0 


1 X 10“® 

16.3 

0.9 

1.2 

18.4 

20.8 

0.7 

0.8 

22.3 



5 X 10“® 

17.5 

1.3 

1.9 

20.7 

22.4 

1.1 

1.6 

25.2 

Simulation 


1 X 10-' 

25.0 

4.9 

6.4 

36.3 

28.2 

2.4 

3.5 

34.2 

Study (2) 

20000 

5 X 10"’^ 

25.0 

5.6 

7.5 

38.0 

28.2 

4.1 

5.4 

37.7 

Uniform 

1 X 10"® 

25.0 

6.2 

8.1 

39.3 

28.2 

4.9 

6.6 

39.7 

Contribution 


5 X 10“® 

25.0 

00 

bo 

11.0 

44.8 

28.3 

7.5 

10.1 

45.9 



1 X 10“' 

30.0 

27.1 

30.5 

87.6 

31.2 

23.2 

28.6 

82.9 


50000 

5 X 10“’^ 

30.0 

27.6 

31.1 

88.8 

31.2 

24.4 

30.3 

85.8 


1 X 10“® 

30.0 

27.8 

31.4 

89.3 

31.2 

24.6 

31.0 

86.7 



5 X 10“® 

30.0 

28.5 

32.0 

90.5 

31.2 

25.4 

31.9 

88.5 



1 X 10“^ 

3.5 

1.4 

1.9 

6.8 

3.2 

0.7 

0.8 

4.7 


10000 

5 X 10“’^ 

3.8 

1.5 

2.1 

7.4 

4.9 

1.0 

1.4 

7.3 


1 X 10“® 

4.1 

1.8 

2.4 

8.3 

5.7 

1.3 

1.8 

8.9 



5 X 10“® 

5.5 

2.9 

3.7 

12.0 

8.5 

2.5 

3.2 

14.2 

Simulation 


1 X 10-' 

16.5 

10.4 

13.1 

40.0 

18.9 

6.3 

8.1 

33.2 

20000 

5 X 10“’^ 

17.0 

11.5 

14.5 

43.1 

21.2 

9.0 

11.5 

41.7 

Study (3) - 
Case 1) 

1 X 10-® 

17.4 

12.3 

15.4 

45.1 

21.9 

10.2 

13.3 

45.4 


5 X 10-® 

18.5 

14.7 

18.7 

51.9 

23.6 

12.9 

17.7 

54.2 



1 X 10-' 

26.3 

29.4 

32.7 

88.4 

29.0 

26.6 

32.2 

87.8 


50000 

5 X 10“’^ 

26.3 

29.4 

32.8 

88.5 

29.0 

26.8 

32.6 

88.4 


1 X 10“® 

26.3 

29.5 

32.8 

88.7 

29.0 

26.9 

32.7 

88.5 



5 X 10“® 

26.3 

29.5 

32.9 

88.8 

29.0 

26.9 

32.9 

88.8 



1 X 10-^ 

0.3 

2.0 

2.8 

5.1 

0.2 

0.9 

1.1 

2.2 


10000 

5 X 10"’^ 

0.3 

2.3 

3.1 

5.7 

0.5 

1.7 

2.1 

4.2 


1 X 10“® 

0.4 

2.6 

3.4 

6.3 

0.6 

2.1 

2.7 

5.5 



5 X 10“® 

0.7 

3.9 

5.0 

9.6 

1.2 

3.5 

4.5 

9.2 

Simulation 


1 X 10-' 

3.9 

13.5 

16.8 

34.2 

3.4 

00 

bo 

11.2 

23.5 

20000 

5 X 10-’^ 

4.4 

14.5 

18.0 

36.9 

5.3 

11.7 

15.2 

32.2 

Study (3) - 
Case 2) 



1 X 10 ® 

4.8 

15.2 

19.0 

39.0 

6.4 

12.7 

16.9 

35.9 


5 X 10“® 

6.3 

17.8 

22.3 

46.4 

9.3 

15.6 

21.4 

46.3 



1 X 10-' 

20.9 

29.7 

32.9 

83.5 

24.2 

27.4 

32.7 

84.3 


50000 

5 X 10"’^ 

21.3 

29.7 

32.9 

84.0 

25.3 

27.5 

32.9 

85.6 


1 X 10"® 

21.5 

29.7 

33.0 

84.2 

25.6 

27.5 

33.0 

86.0 



5 X 10“® 

21.8 

29.7 

33.0 

84.5 

26.1 

27.5 

33.0 

86.5 


Table 5: Mean number of additive, dominant, and recessive SNPs correctly identified 
(out of 34, 33, and 33, respectively) in the two approaches when heritability is 0.4. 
A=additive; D=dominant; R=recessive. 


expected count of the homozygote group for the minor allele is only 1. As the sample 
size increased, there was a substantial increase in identification of dominant and reces¬ 
sive variants. Results from simulation study (3) also support this conjecture. Increased 
effect sizes for SNPs with dominant and recessive effects resulted in more detection of 
these variants at the cost of loss of power for additive SNPs. However, loss of power 
for additive SNPs was much greater than increased power for dominant and recessive 
SNPs when the sample size was 10,000. This result suggests that we need much higher 
sample sizes to detect dominant and recessive variants, compared to the additive SNPs. 
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5 Application to Real Data 

Using the thresholds that yielded the same false positive rate in the two methods (max¬ 
imum BF>1500 and minimum p < 1 x 10“®) in simulation study (1), we compared the 
results obtained with the two methods in the cohorts described in the earlier section, 
using the SNP sets generated after pruning the dependent SNPs. In the NECS data, 
out of 50,894 tested SNPs, nine SNPs were found associated with neuroticism using the 
polynomial model approach, whereas only five SNPs were significant using the standard 
approach (Table 6). Four SNPs were common in both analyses. For these four SNPs, 
the genetic models selected agreed in the two approaches. This result suggests that the 
Bayesian model selection procedures work well in the case of small sample sizes and can 
potentially discover more variants. 

In the CSSCD data, out of 140,864 tested SNPs, ten SNPs were associated with 
fetal hemoglobin in both approaches, and eight SNPs were common in both (Table 6). 
For these eight SNPs in common, five of them agreed in the genetic model selection 
between the two approaches, but three SNPs (rs2239580, rsl2469604, and rs2034614) 
had discrepant results. For rs2239580, the Bayesian procedures selected the dominant 
model, while the standard approach identified the genotypic model. For rsl2469604, 
the Bayesian procedure selected the dominant model, while the additive model had the 
minimum p-value. For rs2034614, the co-dominant model had the maximum BF and 
the genotypic model had the minimum p-value. 

Using our Bayesian polynomial model approach in the NECS data, 4 SNPs had 
dominant models, 3 SNPs had additive models, 1 SNP had a co-dominant model and 1 
SNP had a recessive model. In the CSSCD data, 3 SNPs had dominant models, 3 SNPs 
had co-dominant models, 3 SNPs had recessive models, and I SNP had an additive 
model. These results suggest that different variants may influence the trait through 
different genetic models. Some of these associations would not have been captured if 
an additive model alone was used, and this highlights the need to examine all possible 
genetic models in a computationally efficient manner to ensure that we do not miss any 
interesting findings. 


6 Conclusion 

We propose a Bayesian approach to simultaneously detect the SNPs associated with a 
continuous trait and the mode of inheritance. Our Bayesian approach uses a polynomial 
parameterization of the SNP dosage that can simultaneously represent different genetic 
models and a coherent framework for model selection based on comparing different 
models by their posterior probability (The Wellcome Trust Case Control Consortium 
(2007); Marchini et al. (2007); Servin and Stephens (2007); Guan and Stephens (2008); 
Wakefield (2008); Newcombe et al. (2009); Stephens and Balding (2009); Clark et al. 
(2010); Lorenzo Bermejo et al. (2011); Mailer et al. (2012); Xu et al. (2012)). Cru¬ 
cial to our approach is the use of proper prior distributions on the parameters of the 
polynomial model, from which the prior distributions of specific genetic models can be 
derived. In contrast to this coherent Bayesian approach, it is important to emphasize 
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a) NECS Data 


Bayesian 


SNP 

Chr/Gene 

BFg 

BFa 

BFd 

BFr 

BFc 

BFmax 

rs850610 

1/Clorf203 

l.OeS 

8.0e4 

1.2e6 

1.5el 

1.5e3 

1.2e6 

rs7666974 

4/unknown 

7.6e2 

1.4e-l 

7.4el 

7.5el 

1.2e4 

1.2e4 

rs2333166 

4/unknown 

4.6e2 

3.9e2 

3.4e3 

3.8e-l 

1.5e3 

3.4e3 

rs28011S5 

l/ESRRG 

2.2e2 

4.4el 

4.9e-l 

2.0e3 

4.0e-l 

2.0e3 

rsl869676 

8/unknown 

1.4e3 

5.0e3 

l.Oel 

2.0e3 

3.7el 

5.0e3 

rs8064944 

17/unknown 

2.2e2 

l.Oel 

4.3e3 

2.9e-l 

4.4e-l 

4.3e3 

rs3746314 

19/C19orfl2 

1.8e3 

2.4e3 

9.4e2 

3.7el 

9.9e-l 

2.4e3 

rs9555139 

13/unknown 

6.5e2 

2.2e3 

6.7el 

1.5e2 

1.5el 

2.2e3 

rsl2770017 

10/unknown 

1.4e2 

3.0e-l 

1.9e3 

1.2e-l 

1.9el 

1.9e3 

rsl530239 

2/IKZF2 

7.9el 

5.8el 

7.0e-l 

2.lei 

5.9el 

5.9el 


Frequentist 


SNP 

Chr/Gene 

PHet 

PHcm 

Pa 

Pd 

Pr 

Pc 

Pmin 

rs850610 

1/Clorf203 

2.0e-7 

2.4e-4 

2.5e-7 

1.9e-8 

3.3e-2 

2.1e-5 

1.9e-8 

rs7666974 

4/unknown 

8.4e-4 

7.7e-l 

1.5e-l 

5.3e-2 

6.9e-4 

1.9e-6 

1.9e-6 

rs2333166 

4/unknown 

5.8e-6 

2.3e-l 

5.5e—5 

5.5e-6 

4.4e-l 

8.7e-6 

5.5e-6 

rs28011S5 

l/ESRRG 

7.3e-l 

3.1e-6 

1.9e-3 

l.Oe-1 

2.8e-6 

6.2e-l 

2.8e-6 

rsl869676 

8/unknown 

8.1e-l 

1.5e-l 

3.5e-4 

2.4e-l 

l.le-4 

3.2e-4 

l.le-4 

rs8064944 

17/unknown 

4.0e-4 

2.7e-4 

2.2e-2 

2.1e-4 

2.8e-l 

5.2e-l 

2.1e-4 

rs3746314 

19/C19orfl2 

2.2e-2 

1.5e-3 

8.8e-4 

2.2e-3 

6.0e-3 

l.Oe-1 

8.8e-4 

rs9555139 

13/unknown 

4.2e-l 

6.2e-2 

8.2e-3 

7.0e-2 

1.2e-2 

6.1e-2 

8.2e-3 

rsl2770017 

10/unknown 

2.6e-4 

l.le-3 

9.6e-l 

7.9e-4 

3.8e-l 

6.0e-2 

2.6e-4 

rsl530239 

2/IKZF2 

3.5e-2 

1.4e-3 

3.7e-6 

3.1e-3 

2.3e-5 

5.0e-4 

3.7e-6 


b) CSSCD Data _ 

Bayesian 


SNP 

Chr/Gene 

BFg 

BFa 

BFd 

BFr 

BFg 

B Fmax 

rs6709302 

2/BCLllA 

1.7e4 

1.9e5 

7.4e3 

3.8e2 

2.lei 

1.9e5 

rs7631659 

3/unknown 

1.2e4 

3.4el 

4.4e-l 

1.6e5 

2.2e-l 

1.6e5 

rsl3043968 

20/unknown 

1.9e3 

1.2el 

2.5e-l 

3.1e4 

2.1e-l 

3.1e4 

rs2239580 

14/COCH 

2.5e3 

l.le3 

3.0e4 

1.5e-l 

2.8e4 

3.0e4 

rs6932510 

6/RPS6KA2 

5.9e2 

2.1e3 

5.9e3 

3.1e-l 

3.7e3 

5.9e3 

rsl890911 

14/unknown 

2.5e2 

8.5e-l 

1.5e-l 

4.4e3 

3.3e-l 

4.4e3 

rsl2469604 

2/unknown 

6.1e2 

2.2e3 

2.9e3 

6.3e-l 

1.8e3 

2.9e3 

rs2034614 

12/PRIGKLEl 

1.8e2 

1.5e2 

1.7e3 

1.6e-l 

2.8e3 

2.Se3 

rs2301819 

4/TBC1D14 

8.4el 

5.0el 

1.9e2 

1.9e-l 

2.0e3 

2.0e3 

rs9642124 

7/unknown 

3.8el 

4.8e-2 

3.2el 

9.2e-l 

1.8e3 

1.8e3 

rsll794652 

9/FUBP3 

9.9el 

4.2e2 

5.2el 

6.0e2 

2.1e-l 

6.0e2 

rs7975463 

12/unknown 

3.5el 

1.3el 

1.8e2 

S.4e-2 

8.9e2 

8.9e2 


Frequentist 


SNP 

Chr/Gene 

PHet 

Pnom 

Pa 

Pd 

Pr 

Pc 

Pmin 

rs6709302 

2/BCLllA 

1.2e-4 

2.5e-7 

1.3e-8 

8.0e-7 

2.6e-5 

1.5e-2 

1.3e-8 

rs7631659 

3/unknown 

9.1e-l 

7.6e-8 

7.0e-3 

1.5e-l 

7.5e-8 

9.2e-l 

7.5e-8 

rsl3043968 

20/unknown 

8.9e-l 

5.4e-7 

2.1e-2 

2.8e-l 

4.9e-7 

6.7e-l 

4.9e-7 

rs2239580 

14/COCH 

1.7e-7 

2.1e-l 

4.9e-6 

2.3e-7 

5.7e-l 

3.2e-7 

1.7e-7 

rs6932510 

6/RPS6KA2 

4.9e-6 

2.5e-l 

6.4e—6 

3.3e-6 

3.9e-l 

6.5e—6 

3.3e-6 

rsl890911 

14/unknown 

6.7e-l 

4.7e-6 

1.5e-2 

3.7e-l 

2.6e-6 

2.5e-l 

2.6e-6 

rsl2469604 

2/unknown 

1.2e-5 

2.3e-l 

6.6e-6 

7.2e-6 

2.8e-l 

1.4e-5 

6.6e-6 

rs2034614 

12/PRICKLEl 

7.7e-6 

5.6e-l 

9.4e-5 

1.2e-5 

8.7e-l 

9.0e-6 

7.7e-6 

rs2301819 

4/TBC1D14 

1.9e-5 

7.0e-l 

3.5e-3 

1.3e-4 

3.7e-l 

1.3e-5 

1.3e-5 

rs9642124 

7/unknown 

4.0e-4 

S.7e-1 

9.8e-l 

1.6e-2 

1.4e-2 

1.6e-5 

1.6e—5 

rsll794652 

9/FUBP3 

9.9e-2 

3.3e-6 

7.9e-6 

2.8e-3 

l.le-5 

4.8e-l 

3.3e-6 

rs7975463 

12/unknown 

7.6e-6 

2.8e-l 

6.5e-3 

4.6e-5 

6.2e-l 

1.2e-5 

7.6e-6 


Table 6: Significant results using two approaches in the a) NECS and b) 
CSSCD data. BF=Bayes Factor ;P=p-value; G=genotypic; A=additive; D=donunant; 
R=recessive; C=co-doininant; Het=heterozygote genotype factor in the genotypic 
model; Hom=homozygote genotype factor in the genotypic model. SNPs that were 
significant in both approaches are highlighted in gray. 
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that the frequentist approach does not have a clear way to compare the genotypic model 
(2 degrees of freedom) to the other four specific genetic models (1 degree of freedom). 
The evaluation of the method in simulated data shows that the Bayesian method we 
propose has a slightly higher power when we limit to false positive rates at very small 
values and this is a particularly attractive property in genome-wide association studies 
in which the large number of SNPs analyzed requires to set the false positive rate to 
extremely small numbers. An additional attractive feature of this method is the gain 
in computations; The proposed method codes five genetic models simultaneously using 
a single polynomial parameterization instead of fitting five different genetic models for 
each SNP. This contrasts with the standard approach in which recoding of the SNP 
genotype and conducting 5 analyses is necessary to evaluate all five models concur¬ 
rently. 

An important theoretical implication of this particular parameterization is that it 
shows that different genetic models are functionally related. We have shown that there 
is a mathematical relationship between the parameters of the polynomial model and 
each of the five genetic models. This relation also suggests that when all five genetic 
models are evaluated the effective number of tests per SNP is less than 5. In practice, 
GWASs suffer from severe correction for multiple testing, and evaluation of several 
genetic models for each SNP aggravates this issue. However, our work suggests that the 
correction for multiple testing should be less severe as the effective number of tests is 
less than the number of models fitted, when evaluating all five genetic models, although 
it is not immediately obvious how Bonferroni type corrections should benefit from this 
result. 

The proposed work can be particularly useful for genome-wide data consisting of 
millions of SNPs. This work, at the current state, is limited to the case where the 
trait is quantitative, as we can obtain closed form solutions for the marginal likeli¬ 
hood and BF. More work is needed to evaluate a similar approach when the trait of 
interest is binary or time-to-event. In particular, when performing logistic regression 
in the GWAS context, alternative measures of associations such as approximate Bayes 
Factor or Bayesian false-discovery probability (Wakefield (2007, 2008, 2009)) can be 
considered. 


Supplementary Material 

Supplementary Materials: Bayesian Polynomial Regression Models to Fit Multiple Ge¬ 
netic Models for Quantitative Traits (DOT 10.I2I4/I4-BA880SUPP; .pdf). 
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